Skip to content

fix(pool): probe liveness before reusing session-pinned WS upstream#3021

Closed
thiscantbeserious wants to merge 1 commit intomaximhq:mainfrom
thiscantbeserious:fix/3002-pool-stale-session-conn
Closed

fix(pool): probe liveness before reusing session-pinned WS upstream#3021
thiscantbeserious wants to merge 1 commit intomaximhq:mainfrom
thiscantbeserious:fix/3002-pool-stale-session-conn

Conversation

@thiscantbeserious
Copy link
Copy Markdown

Summary

After the upstream WebSocket server sends its terminal event and then closes the connection (e.g. with code 1011 on keepalive timeout), Bifrost would hand out the same connection to the next request on the same client session. The connection appeared live (IsClosed() returned false, age was within limits) because the WS close frame sat buffered in the socket and had never been read. The next caller would succeed on WriteMessage (OS buffer accepts the write) and then immediately fail on the first ReadMessage when the buffered close frame was returned as an error. This caused a fallback to the HTTP bridge and an error frame to the client.

The root cause is that Pool.Get relied only on an atomic bool (set by Bifrost's own Close call) and wall-clock expiry checks to determine connection health. Neither of those detects a server-side close that Bifrost never consumed.

Changes

  • transports/bifrost-http/websocket/pool.go: added (*Pool).isLive helper that sets a 1ms read deadline on an idle connection and calls ReadMessage. A timeout error (no queued data) confirms the connection is alive. A close frame, EOF, or any other error reveals it is stale. The probe runs in Pool.Get after the existing IsClosed / isExpired checks, before incrementing inFlight and returning the connection to the caller. Stale connections are closed and the loop continues to the next idle candidate or a fresh dial.
  • transports/bifrost-http/websocket/pool_test.go: added TestPoolGetEvictsStaleSessionConn. A mock WS server accepts one connection, then closes it server-side after the test returns the connection to the pool. Pool.Get is called again and must produce a fresh connection (new dial), not the stale one.

Type of change

  • Bug fix

Affected areas

  • Transports (HTTP)

How to test

go test ./transports/bifrost-http/websocket/... -cover -run TestPoolGetEvictsStaleSessionConn -v
go test ./transports/bifrost-http/websocket/... -cover
go vet ./transports/bifrost-http/...

Expected: TestPoolGetEvictsStaleSessionConn passes, all existing tests pass, no vet warnings.

Integration test (manual): configure a WS upstream that closes connections after the terminal event. Send two sequential requests on the same Bifrost WS session with a pause between them. Before this fix, request 2 would receive an error frame. After this fix, request 2 succeeds with a fresh upstream connection.

Breaking changes

  • No

Related issues

Closes #3002

This fix was discovered while working on PR #2775 (openai-oauth-passthrough). The feature branch in that PR contains an earlier version of the liveness probe, which has been extracted here and removed from the feature branch in a separate commit.

Security considerations

None. The probe is a local read with a 1ms deadline on a connection that Bifrost owns exclusively while it is in the idle pool. No external input is involved and no credentials are affected.

Checklist

  • I read docs/contributing/README.md and followed the guidelines
  • I added/updated tests where appropriate
  • I updated documentation where needed
  • I verified builds succeed (Go and UI)
  • I verified the CI pipeline passes locally if applicable

…q#3002)

Pool.Get now performs a 1ms read-deadline probe on each idle connection
before handing it out. A timeout error (no queued data) confirms the
connection is alive. A close frame, EOF, or any other error reveals that
the upstream has closed the connection server-side; the stale entry is
discarded and a fresh dial is performed. This closes the window where a
session-pinned upstream connection appeared live (IsClosed() == false,
age within limits) but carried a buffered close frame from a previous
response, causing the next request to fail immediately on first read.

Adds TestPoolGetEvictsStaleSessionConn: mock WS server that closes the
connection after delivering its response; asserts Pool.Get dials fresh.

Closes maximhq#3002

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 24, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 81eca13b-013e-467b-a632-961f44791f73

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

thiscantbeserious added a commit to thiscantbeserious/bifrost that referenced this pull request Apr 24, 2026
…ure-branch duplicate)

The liveness probe for stale session-pinned WS pool connections (maximhq#3002)
was implemented directly on fix/3002-pool-stale-session-conn off main.
The feature branch never contained a parallel implementation of this probe,
so there is no code to remove. This commit marks the extraction as complete
and links the focused fix to this branch for traceability.

Related: maximhq#3021, maximhq#3002

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@thiscantbeserious
Copy link
Copy Markdown
Author

Superseded by #3018 (merged 2026-04-24), which bundles the fix for this issue and several other native WS reliability bugs. Closing this draft as redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: WebSocket pool hands out stale session-pinned upstream connections after server-side close

1 participant